Skip to content

Update env, cmake, GEOS_Util and MAPL releases in components.yaml & update README.md after decommissioning of SLES12 at NCCS#796

Merged
gmao-rreichle merged 11 commits intodevelopfrom
feature/wjiang/cleanup_helsurface
Apr 2, 2025
Merged

Update env, cmake, GEOS_Util and MAPL releases in components.yaml & update README.md after decommissioning of SLES12 at NCCS#796
gmao-rreichle merged 11 commits intodevelopfrom
feature/wjiang/cleanup_helsurface

Conversation

@mathomp4
Copy link
Member

@mathomp4 mathomp4 commented Feb 26, 2025

This non-0-diff PR updates the components.yaml to approximately match that of GEOSgcm main as of 2025-Mar-19.

env:       v4.29.1 --> v4.36.0
cmake:     v3.52.0 --> v3.57.0
GEOS_Util: v2.1.3  --> v2.1.6
MAPL:      v2.50.1 --> v2.54.2

The non-0-diff changes are within "roundoff," and are caused by the newer compiler/baselibs version. Intel tests with standard optimization are 0-diff when bit shaving is not used.

Note that the ESMA_env, ESMA_cmake, and MAPL versions are slightly newer than those of GEOSgcm main, but per the respective release notes this should be 0-diff w.r.t. what is in GEOSgcm main (but not 0-diff w.r.t. what is on GEOSldas develop before this PR!).

The PR also updates README.md to reflect that SLES15 is now the only O/S on the NCCS Discover platform.

Earlier versions of this PR also included the helfsurface() optimization of GEOS-ESM/GMAO_Shared#348, which requires a newer Intel compiler but should be zero-diff (which is why it will be done in a separate PR).

@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented Mar 11, 2025

Testing summary:
Almost all tests comparison will fail. Confirming PR is not zero diff. All differences are roundoff.

  • Baselibs change from GCC 12 to GCC 14 the reason for all GNU fails.
  • Intel fails are from new compilers.

`Runtype Clone Build Build Time Model Run/Compare Assim Run/Compare

conus pass pass 13 min pass/FAIL -- / --
global -- -- -- pass/FAIL pass/pass
globalcs -- -- -- pass/FAIL pass/pass
globalcnclm4 -- -- -- pass/FAIL -- / --
debugconus -- pass 11 min pass/pass -- / --
aggconus -- pass 14 min pass/FAIL -- / --
aggglobal -- -- -- pass/FAIL pass/FAIL
aggglobalcs -- -- -- pass/FAIL pass/FAIL
aggglobalcnclm4 -- -- -- pass/FAIL -- / --
gnuconus pass pass 31 min pass/FAIL -- / --
gnuglobal -- -- -- pass/FAIL pass/FAIL
gnuglobalcs -- -- -- pass/FAIL pass/FAIL
gnuglobalcnclm4 -- -- -- pass/FAIL -- / --
gnudebugconus -- pass 20 min pass/pass -- / --`

Note: Helfand is not used as option during testing ( we use Louis as default) so for PR this change to use branch is trivially zero diff.

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 11, 2025

Hmm. You are getting failures from the new Intel? Are these build-time or run-time? That is, did things crash or just get different answers?

@gmao-rreichle
Copy link
Contributor

@biljanaorescanin, @mathomp4,

I'm a bit confused by this PR and think we need to separate the update of the environment and the Helfand update. Specifically:

  1. It is not correct that only Louis is used. The GLOBALCS/assim tests use Helfand:
LDAS_AGGGLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1
LDAS_GLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1
LDAS_GNUGLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1
  1. I'm surprised that the comparison passes for the GLOBAL/assim test but fails for the GLOBAL/model test (and similarly for other tests).

I think it would be best to remove the Helfand branch from the PR and examine the impact of the environment update in isolation.

@biljanaorescanin
Copy link
Contributor

  1. I forgot CS uses Helfand.
  2. I'm running now without the helfand branch. I did have that run few weeks back but I removed it and didn't leave a comment on the PR of what the answer was.

@biljanaorescanin
Copy link
Contributor

biljanaorescanin commented Mar 11, 2025

  1. Running without helfand branch is zero diff to running with helfand branch for intel tests confirming again helfand vectorization is zero diff.
  2. You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14.
    Maybe it can also be a discover glitch. Not 100% sure now since build failed.
    For my previous test summary @mathomp4 changed his sandbox to GCC 14.
Screenshot 2025-03-11 at 12 35 53 PM

@gmao-rreichle
Copy link
Contributor

gmao-rreichle commented Mar 11, 2025

Thanks, @biljanaorescanin. Here are my 2c:

Running without helfand branch is zero diff to running with helfand branch for intel tests confirming again helfand vectorization is zero diff.

This is great, but I still think we want this to be in a separate PR for clarity. When releases are made, the release doc is basically a collection of PR titles. Having a separate PR for the zero-diff helfsurface() optimization change makes it much easier to understand what was done when a few months have passed and nobody can remember off the top of their head. I edited the present PR accordingly. Once the present PR has been merged, we can test and merge the GMAO_Shared helfsurface() optimzation PR GEOS-ESM/GMAO_Shared#348

You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14. Maybe it can also be a discover glitch. Not 100% sure now since build failed.
For my previous test summary @mathomp4 changed his sandbox to GCC 14.

What does it take to include the GCC-14 change into this PR? It doesn't make sense to me to merge this PR when it doesn't work with the current GNU version. Maybe I'm missing something.

Also, I'm still surprised that the comparison passes for the GLOBAL/assim test but fails for the GLOBAL/model test (and similarly for other tests). This could be a difference in 1d (tile) vs. 2d output and MAPL HISTORY regridding. Before we can merge the PR, we need to understand better what exactly is not zero-diff here.

@gmao-rreichle gmao-rreichle changed the title WIP: Testing helfsurface update in GEOS_Util WIP: Update components.yaml to match GEOSgcm main as of 2025-Feb-26 Mar 11, 2025
@biljanaorescanin
Copy link
Contributor

If I only focus to intel you will see only NC4 files fail and it is for roundoff:
conus/cmp_model.conus.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True global/cmp_model.global.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True globalcnclm4/cmp_model.globalcnclm4.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True globalcs/cmp_model.globalcs.log: Comparing outputs failed! BIN: True, NC4: False, RST: True

@mathomp4
Copy link
Member Author

Thanks, @biljanaorescanin. Here are my 2c:

You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14. Maybe it can also be a discover glitch. Not 100% sure now since build failed.
For my previous test summary @mathomp4 changed his sandbox to GCC 14.

What does it take to include the GCC-14 change into this PR? It doesn't make sense to me to merge this PR when it doesn't work with the current GNU version. Maybe I'm missing something.

This is actually an issue with the scripting. In the regression scripts, for GNU runs I have to replace the g5_modules with one for GNU. At the moment, the LDAS scripting still uses a GCC 12 g5_modules (since moving to GCC 14 would be non-zero-diff).

I can change that in the scripting, and then the GNU tests would go NZD the next time things run.

@gmao-rreichle gmao-rreichle changed the title WIP: Update components.yaml to match GEOSgcm main as of 2025-Feb-26 Update env, cmake, GEOS_Util and MAPL releases in components.yaml & update README.md after decommissioning of SLES12 at NCCS Mar 19, 2025
tick up minor release, should be 0-diff per respective release notes
Copy link
Contributor

@gmao-rreichle gmao-rreichle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments below.

local: ./@env
remote: ../ESMA_env.git
tag: v4.29.1
tag: v4.36.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biljanaorescanin, @mathomp4 : I ticked up the versions of env, cmake, and MAPL (c1007c8). Based on the documentation of the respective releases, this should be zero-diff w.r.t. what was on the PR before my latest edits (but definitely non-0-diff w.r.t. current develop). @mathomp4, please let me know if you have any objections or suggestions. @biljanaorescanin, when you get a chance, please re-test the PR. If all is as expected, the new test is 0-diff w.r.t. the most recent test (if you still have a copy).

@biljanaorescanin
Copy link
Contributor

Tests are zero diff to previous iteration of testing.
We don't get GNU tests since I didn't use right baselibs that needs to be changed on Matt's side, but we do have results from before if anyone want's to take a look it was roundoff difference.

@gmao-rreichle
Copy link
Contributor

@mathomp4, @biljanaorescanin, @weiyuan-jiang:

I am still trying to understand the very unusual non-0-diff character of this PR. Specifically, the LDAS_GLOBAL/model test fails the comparison for the nc4 files (in just a small subset of variables, and within what seems to be roundoff). The curious thing is that the LDAS_GLOBAL/assim test passes!

If there was any change in the science code (or a roundoff change in the science calcs triggered by the newer env/baselibs), then the assim test should also fail the nc4 comparison. The fact that the assim test passes suggests that it's something in MAPL and/or the LDAS tile_bin2nc4 utility. That is, the variables would need to be 0-diff when they in memory during the simulation, but then something changes when the data are written out.

I noticed that the Intel tests with standard optimization that do pass have no bit shaving in HISTORY.rc, whereas the tests that fail the nc4 comparison have bit shaving enabled. I went through the documentation of the MAPL releases between 2.50.1 and 2.54.2 and didn't notice anything that might have impacted the bit shaving, and the documentation suggests that for the most part the MAPL releases in question should all be 0-diff among themselves (for the GCM, which is usually a bigger hurdle than LDAS when it comes to 0-diff). So I can't see how exactly the bit shaving might cause the non-0-diffs seen here, but I also can't quite rule it out.

Thoughts?

@weiyuan-jiang
Copy link
Contributor

weiyuan-jiang commented Mar 21, 2025

The failed comparison in model run is on files that are not in the assim run. So we probably don't need to worry about this. What happened to GCM's history output with bit shaving? @mathomp4

@gmao-rreichle
Copy link
Contributor

The failed comparison in model run is on files that are not in the assim run. So we probably don't need to worry about this.

@weiyuan-jiang, I'm not sure I understand the reasoning. Of course the "model" test case has different HISTORY output. What I'm after is understanding which exact changes in the PR caused the non-0-diff result for the model test case. Normally, anything that causes non-0-diff in the model test case would also cause non-0-diff in the assim test case. The fact that the model case is non-0-diff but the assim case is 0-diff is very unusual, and I'd really like to be able to explain this so we can make more informed decisions about how to interpret the non-0-diff changes in science applications going forward

@gmao-rreichle
Copy link
Contributor

For lack of a better idea I just tested a variant of the PR's branch that reverts MAPL back to 2.50.1. I only ran the Intel tests w/ standard optimization. The result is 0-diff w.r.t. using MAPL 2.54.2, so MAPL is not the cause of the non-0-diff result vs. develop. Still a mystery to me why we get non-0-diff for output from the "model" test but 0-diff for the "assim" test.

@biljanaorescanin
Copy link
Contributor

In our regression testing we got test fail: global -- -- -- pass/FAIL pass/pass

If I run just GLOBAL/model test and comment out in HISTORY.rc *.nbits: 12, so we are not using bit shaving, then our develop vs branch comparison is zero diff for both collection comparisons.
In our GLOBAL/assim run we don't use bit shaving in history.rc and that is why run was a pass for comparison.

@gmao-rreichle gmao-rreichle added the documentation Improvements or additions to documentation label Apr 2, 2025
@gmao-rreichle gmao-rreichle marked this pull request as ready for review April 2, 2025 16:28
@gmao-rreichle gmao-rreichle requested review from a team as code owners April 2, 2025 16:28
@gmao-rreichle gmao-rreichle merged commit 6455631 into develop Apr 2, 2025
11 checks passed
@gmao-rreichle gmao-rreichle deleted the feature/wjiang/cleanup_helsurface branch April 2, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation Not 0-diff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants